Verifiable Reasoning: LLMs as Code Generators for Quant VC Founder Screening
What Verifiable Reasoning contributes to quant VC
Verifiable Reasoning reframes how large language models are used inside quant VC founder screening. Instead of calling an LLM once per founder to evaluate each rule, the paper prompts the LLM once to emit executable Python code, and then runs that code deterministically across the entire founder dataset. The result is a quant VC screening pipeline that is cheaper, faster, reproducible, and auditable, without giving up the interpretability that quant VC firms require to justify their decisions to LPs and partners.
On VCBench, Vela's standardized quant VC benchmark of 4,500 anonymized founders with a 9% base rate of success, Verifiable Reasoning achieves 37.5% precision and an F0.5 score of 25.0%, a 4.17x lift over random indexing. This beats GPT-4o (30.0% precision, F0.5 of 25.7%) while ensuring every prediction traces back to human-readable founder attributes that a quant VC partner can inspect and edit.
What is quant VC, and where does Verifiable Reasoning fit?
Quant VC is the application of quantitative, reproducible, empirically validated methods to venture capital decision-making. Quant VC treats venture capital as a rare-event prediction problem that can be modeled, measured, and improved with the same rigor that quantitative finance brings to credit risk or that quantitative medicine brings to diagnostic screening. Quant VC requires quantitative scoring against honest baselines, reproducible methodology, and interpretability that allows every prediction to be audited.
Verifiable Reasoning sits inside Vela's LLM-Augmented ML research thread. That thread treats the LLM as a component inside a classical machine learning pipeline rather than as the reasoning substrate itself. Where Vela's Think-Reason-Learn family teaches LLMs to reason like investors through decision trees, rule forests, and policy induction, the LLM-Augmented ML family uses LLMs to enrich classical ML: as feature generators, as symbolic rule writers, as persona clusterers, and in this paper, as code generators. This architectural choice matters for quant VC because it decouples the scaling cost of screening from the pricing of frontier LLM APIs.
How does Verifiable Reasoning turn LLMs into quant VC code generators?
Rather than asking the LLM “did this founder scale a company past 200 employees” 4,500 times, Verifiable Reasoning asks the LLM once to write the Python predicate that checks this condition, and then runs the predicate against all 4,500 structured founder profiles. The pipeline has four stages.
Rule generation with stratified sampling. Each iteration sends a single prompt containing the founder data schema, helper functions for parsing QS rankings, tenure strings, and company size ranges, and 120 stratified founder profiles (60 successful and 60 unsuccessful). This balanced sampling corrects for the 9% base rate so the LLM does not generate rules that trivially predict failure. The LLM emits 30 candidate rules as Python lambda expressions, each with a name, a description, and an executable body.
Deterministic evaluation. Every rule is compiled to a callable and executed across the full founder dataset with no further LLM involvement. A rule like elite_school_and_exit, which checks whether a founder attended a top-50 QS-ranked university and has at least one prior exit, evaluates identically across runs. This is where the 99% cost reduction over per-sample LLM rule evaluation comes from, and where stochastic hallucination risk is eliminated at inference time.
Statistical validation. Each rule is filtered by three thresholds that matter for quant VC screening: precision lift greater than 1.0 (the rule identifies above-baseline founders), one-sided binomial p-value below 0.05 (the lift is statistically significant), and coverage between 1% and 50% (the rule has enough power to test but is not so broad that it captures generic attributes). In Iteration 1, 19 of 30 rules pass validation. In Iteration 2, 21 of 30 pass with a higher mean precision lift of 2.58x versus 1.70x.
Cluster-based gap analysis. After validation, founders are clustered by their rule activation patterns using K-means. Mixed clusters (7% to 12% success rate with more than 400 founders) and low-success clusters with rare winners are flagged. Gap analysis reveals that 68% of founders fall into mixed clusters where Iteration 1 rules fail to discriminate, and 57% of successful founders are not covered by the top rules. This structured feedback, rather than human annotation, guides the next iteration of rule generation.
How accurate is Verifiable Reasoning?
The evaluation runs on VCBench, the world's first AGI benchmark for venture capital, which provides 4,500 anonymized founder profiles with a 9% success prevalence. Success is defined as an IPO, an acquisition, or a funding round above $500M. The downstream classifier is a logistic regression trained on validated rule activations with balanced class weights and 5-fold stratified cross-validation, threshold-tuned for F0.5.
Headline numbers from the paper:
- Verifiable Reasoning Iteration 2 achieves 37.5% precision and an F0.5 score of 25.0%, a 4.17x lift over the 9% VCBench baseline.
- Iteration 1 already reaches 30.6% precision and an F0.5 score of 27.7%, matching GPT-4o's precision with higher F0.5.
- Iteration 2 discovers high-lift rules that Iteration 1 misses, including
cybersecurity_domain_depthat 6.54x lift,ipo_only_exitat 4.76x, andrepeat_exit_founderat 4.37x. - The pipeline requires one LLM call per iteration for rule generation, versus the O(rules x founders) calls required by per-sample LLM rule evaluation.
Baseline comparisons on the same VCBench dataset:
- Random indexing: 9.0% precision.
- Y Combinator: 14.0% precision, F0.5 of 8.6.
- Tier-1 VCs: 23.0% precision, F0.5 of 10.7.
- Claude-3.5-Haiku: 15.8% precision, F0.5 of 18.2.
- GPT-4o: 30.0% precision, F0.5 of 25.7.
- o3: 43.2% precision, F0.5 of 21.5.
- Verifiable Reasoning (Iteration 2): 37.5% precision, F0.5 of 25.0.
At Vela's real-world founder-screening prevalence of 1.9%, the Think-Reason-Learn and LLM-Augmented ML families in production reach 19% to 38% precision, a 10x to 20x lift over the 1.9% US unicorn base rate. Verifiable Reasoning contributes an auditable, code-generated layer to that stack.
Why code generation matters for quant VC
A quant VC firm that screens tens of thousands of founders a year cannot afford per-sample LLM evaluation. The cost scales linearly with dataset size, the answers drift across runs because of sampling temperature, and there is no way to inspect why a given query returned a given verdict. Verifiable Reasoning solves all three at once: one API call per iteration, deterministic evaluation at scale, and every rule visible as a few lines of Python that a non-technical partner can read and edit.
This is the quant VC version of the same architectural argument that makes compiled code preferable to interpreted natural language in any production system. The LLM is the most expressive part of the pipeline, so it gets used where expressivity matters (writing the rule), not where determinism matters (evaluating the rule on 4,500 founders).
What makes Verifiable Reasoning auditable for quant VC decisions
Every prediction from Verifiable Reasoning decomposes into a list of boolean rules that fired for a given founder, each with its own precision lift, coverage, and p-value. A quant VC partner can ask “why did this founder score high” and get back a sentence like “this founder matches repeat_exit_founder (4.37x lift), elite_qs_top10 (2.18x lift), and founder_scaled_company_200plus (2.80x lift)”. The partner can disagree with a rule, rewrite its Python in a few lines, and rerun the pipeline in seconds without another LLM call. This is the auditability property that separates quant VC from black-box machine learning applied to venture.
The logistic regression coefficients on top of the rule matrix give additional interpretability. The relative weight of each validated rule is a single learned number, not a hidden attention map.
How Verifiable Reasoning fits into Vela's quant VC research program
Verifiable Reasoning is one of four papers in Vela's LLM-Augmented ML research thread, which treats LLMs as components inside classical ML pipelines rather than as reasoners end-to-end:
- LLM-AR: neural-symbolic screening that converts LLM heuristics into ProbLog rules, reaching 59.5% precision at a 5.9x lift over random.
- GPT-HTree: hierarchical clustering combined with LLM-derived founder personas for interpretable segmentation.
- Rare-event prediction: LLM-powered feature engineering combined with an XGBoost, Random Forest, and Logistic Regression ensemble, reaching an 11.1x lift over random.
- Verifiable Reasoning (this paper): LLMs as deterministic code generators rather than per-sample evaluators.
The broader quant VC program at Vela also includes the Think-Reason-Learn family (GPTree, Random Rule Forest, Reasoned Rule Mining, Policy Induction), the VCBench benchmark, and the Multi-Agent Framework that became Vela's V agent.
Limitations
The paper is explicit about its limitations, and they are worth stating directly. Rules are limited to patterns expressible as Python predicates over structured fields, which means signals requiring natural-language understanding of job descriptions need hybrid approaches. Iteration 2 trades recall for precision as the LLM generates more specialized rules, and balancing this may require explicit coverage targets or multi-objective optimization. The evaluation is limited to VCBench, so generalization to other datasets or success definitions still needs validation. The cluster-guided refinement loop is a heuristic, analyst-in-the-loop procedure rather than a fully automated optimization.
Read the paper
From Stochastic Answers to Verifiable Reasoning: Interpretable Decision-Making with LLM-Generated Code.
Anirudh Jaidev Mahesh, Ben Griffin, Fuat Alican, Joseph Ternasky, Zakari Salifu, Kelvin Amoaba, Yagiz Ihlamur, Aaron Ontoyin Yin, Aikins Laryea, Afriyie Samuel, Yigit Ihlamur.
arXiv preprint arXiv:2603.13287, February 2026.
Read on arXiv.
Verifiable Reasoning is part of Vela's quant VC research program, anchored by Think-Reason-Learn and benchmarked on VCBench. For adjacent work in the LLM-Augmented ML thread, see LLM-AR, GPT-HTree, and Rare-event prediction. For the multi-agent line, see Founder-GPT and SSFF.
Authored by members of the Vela team. See the full roster of contributors.
For research collaboration on quant VC, LLM-based decision systems, interpretable feature engineering, or founder screening, email engage@vela.partners.