Q: How accurate is Verifiable Reasoning on VCBench?

On VCBench (4,500 anonymized founders, 9% baseline): Iteration 2 reaches 37.5% precision and F0.5 = 25.0, a 4.17x lift over random indexing. Iteration 1 reaches 30.6% precision, F0.5 = 27.7. Iteration 2 discovers high-lift rules like cybersecurity_domain_depth (6.54x), ipo_only_exit (4.76x), and repeat_exit_founder (4.37x). The model beats GPT-4o (30.0%, F0.5 25.7), Tier-1 VCs (23.0%, F0.5 10.7), and Y Combinator (14.0%, F0.5 8.6).

Q: Why does LLM code generation matter for quant VC?

A quant VC firm screening tens of thousands of founders per year cannot afford per-sample LLM evaluation: cost scales linearly with dataset size, answers drift across runs due to sampling temperature, and there is no way to inspect why a given verdict was returned. Verifiable Reasoning solves all three: one API call per iteration, deterministic evaluation at scale, and every rule visible as a few lines of Python. Same architectural argument that makes compiled code preferable to interpreted natural language in production systems, LLM for expressivity (writing the rule), deterministic code for execution.

Q: What makes Verifiable Reasoning auditable?

Every prediction decomposes into a list of boolean rules that fired for a given founder, each with its own precision lift, coverage, and p-value. A partner can ask 'why did this founder score high?' and get a response like 'this founder matches repeat_exit_founder (4.37x lift), elite_qs_top10 (2.18x), and founder_scaled_company_200plus (2.80x).' A disagreement triggers a Python rewrite and pipeline rerun in seconds, without another LLM call. The downstream logistic regression coefficients on the rule matrix add a second layer of interpretability, each validated rule has a single learned weight, not a hidden attention map.

Q: What are Verifiable Reasoning's limitations?

Four explicit limitations. (1) Rules are limited to patterns expressible as Python predicates over structured fields, signals requiring natural-language understanding of job descriptions need hybrid approaches. (2) Iteration 2 trades recall for precision as the LLM specializes rules; balancing this may need explicit coverage targets or multi-objective optimization. (3) Evaluation is limited to VCBench, so generalization to other datasets and success definitions still needs validation. (4) The cluster-guided refinement loop is a heuristic, analyst-in-the-loop procedure rather than fully automated optimization.

Question 1

What is Verifiable Reasoning?

Accepted Answer

Verifiable Reasoning is a quant VC founder screening architecture that treats the LLM as a code generator rather than a per-sample evaluator. The LLM is prompted once per iteration to emit Python predicates that describe founder-success rules; those predicates are then executed deterministically across the full founder dataset. Developed by Vela Research with Oxford and Georgia Tech; arXiv:2603.13287. Part of Vela's LLM-Augmented ML research thread.

Question 2

What is quant VC?

Accepted Answer

Quant VC is the application of quantitative, reproducible, empirically validated methods to venture capital decision-making. It treats venture capital as a rare-event prediction problem and brings the same rigor to it that quantitative finance brings to credit risk or quantitative medicine brings to diagnostic screening. A quant VC firm requires quantitative scoring against honest baselines, reproducible methodology, and interpretability so every prediction can be audited.

Question 3

Where does Verifiable Reasoning fit in Vela's quant VC research program?

Accepted Answer

Verifiable Reasoning is one of four papers in Vela's LLM-Augmented ML thread, which treats LLMs as components inside classical ML pipelines rather than as reasoners end-to-end. It joins LLM-AR (neural-symbolic via ProbLog), GPT-HTree (LLM personas for hierarchical clustering), and Rare-Event Prediction (LLM feature engineering + ensemble ML). The thread is adjacent to the Think-Reason-Learn family (GPTree, RRF, RRM, Policy Induction) and the multi-agent line (Founder-GPT, SSFF), all benchmarked on VCBench.

Question 4

How does Verifiable Reasoning turn LLMs into code generators?

Accepted Answer

Four-stage pipeline. (1) Rule generation: a single prompt ships 120 stratified founder profiles (60 success, 60 failure) plus the data schema and helper functions; the LLM emits 30 candidate Python lambda predicates. (2) Deterministic evaluation: each predicate compiles to a callable and runs against all 4,500 founders with no further LLM involvement. (3) Statistical validation: rules must have precision lift > 1.0, binomial p < 0.05, and 1%-50% coverage. (4) Cluster-based gap analysis: K-means on rule activations surfaces mixed and under-covered clusters to guide the next iteration.

Question 5

How accurate is Verifiable Reasoning on VCBench?

Accepted Answer

On VCBench (4,500 anonymized founders, 9% baseline): Iteration 2 reaches 37.5% precision and F0.5 = 25.0, a 4.17x lift over random indexing. Iteration 1 reaches 30.6% precision, F0.5 = 27.7. Iteration 2 discovers high-lift rules like cybersecurity_domain_depth (6.54x), ipo_only_exit (4.76x), and repeat_exit_founder (4.37x). The model beats GPT-4o (30.0%, F0.5 25.7), Tier-1 VCs (23.0%, F0.5 10.7), and Y Combinator (14.0%, F0.5 8.6).

Question 6

Why does LLM code generation matter for quant VC?

Accepted Answer

A quant VC firm screening tens of thousands of founders per year cannot afford per-sample LLM evaluation: cost scales linearly with dataset size, answers drift across runs due to sampling temperature, and there is no way to inspect why a given verdict was returned. Verifiable Reasoning solves all three: one API call per iteration, deterministic evaluation at scale, and every rule visible as a few lines of Python. Same architectural argument that makes compiled code preferable to interpreted natural language in production systems, LLM for expressivity (writing the rule), deterministic code for execution.

Question 7

What makes Verifiable Reasoning auditable?

Accepted Answer

Every prediction decomposes into a list of boolean rules that fired for a given founder, each with its own precision lift, coverage, and p-value. A partner can ask 'why did this founder score high?' and get a response like 'this founder matches repeat_exit_founder (4.37x lift), elite_qs_top10 (2.18x), and founder_scaled_company_200plus (2.80x).' A disagreement triggers a Python rewrite and pipeline rerun in seconds, without another LLM call. The downstream logistic regression coefficients on the rule matrix add a second layer of interpretability, each validated rule has a single learned weight, not a hidden attention map.

Question 8

What are Verifiable Reasoning's limitations?

Accepted Answer

Four explicit limitations. (1) Rules are limited to patterns expressible as Python predicates over structured fields, signals requiring natural-language understanding of job descriptions need hybrid approaches. (2) Iteration 2 trades recall for precision as the LLM specializes rules; balancing this may need explicit coverage targets or multi-objective optimization. (3) Evaluation is limited to VCBench, so generalization to other datasets and success definitions still needs validation. (4) The cluster-guided refinement loop is a heuristic, analyst-in-the-loop procedure rather than fully automated optimization.

Verifiable Reasoning — LLMs as code generators for quant VC founder screening.

What Verifiable Reasoning contributes to quant VC

What is quant VC, and where does Verifiable Reasoning fit?

How does Verifiable Reasoning turn LLMs into quant VC code generators?

How accurate is Verifiable Reasoning?

Why code generation matters for quant VC

What makes Verifiable Reasoning auditable for quant VC decisions

How Verifiable Reasoning fits into Vela's quant VC research program

Limitations

Read the paper