VCBench: The AGI Benchmark for Quant VC

Paper
VCBench: Benchmarking LLMs in Venture Capital.
Authors
Rick Chen (University of Oxford), Joseph Ternasky, Afriyie Samuel Kwesi, Ben Griffin (University of Oxford), Aaron Ontoyin Yin, Zakari Salifu, Kelvin Amoaba, Xianling Mu (University of Oxford), Fuat Alican, Yigit Ihlamur (Vela Research).
Venue
arXiv preprint, September 2025.
Status
Public benchmark at vcbench.com. arXiv:2509.14448.
Research program
The evaluation infrastructure of Vela's quant VC research. Every Vela model is measured against VCBench as the decisive public test.

What VCBench contributes to quant VC

VCBench is the first AGI benchmark for venture capital. It is a standardized, anonymized, publicly hosted dataset of 9,000 founder profiles with ground-truth success labels, designed so that language models, human investors, and specialized quant VC models can all be measured on the same task under the same conditions. Before VCBench, quant VC research had no shared yardstick. Every paper defined its own dataset, its own success definition, and its own evaluation protocol. Comparison across papers was essentially impossible.

VCBench closes that gap. It specifies a fixed founder-success prediction task, holds the label schema constant ($500M+ IPO, acquisition, or total funding), scales the prevalence to a statistically stable 9 percent, and releases the data in two formats: structured JSON for custom ML models, and anonymized prose for direct LLM input. The benchmark paper evaluates nine state-of-the-art LLMs and establishes that several of them surpass human investor baselines on the task, including Y Combinator and tier-1 VC firms. The leaderboard is public and open to new submissions at vcbench.com.

What is quant VC, and where does VCBench fit?

Quant VC is the application of quantitative, reproducible, empirically validated methods to venture capital decision-making. Quant VC treats venture capital as a rare-event prediction problem that can be modeled, measured, and improved with the same rigor that quantitative finance brings to credit risk or quantitative medicine brings to diagnostic screening. Quant VC requires quantitative scoring against honest baselines, reproducible methodology, and interpretability that allows every prediction to be audited.

VCBench is the measurement infrastructure of quant VC. Every Vela model, from the multi-agent systems line (Founder-GPT, SSFF) to the Think-Reason-Learn family (GPTree, Random Rule Forest, Reasoned Rule Mining, Policy Induction), is evaluated on VCBench as a decisive public test. VCBench also defines the human baselines that every quant VC model must clear to justify deployment: the market index at 1.9 percent real-world precision, Y Combinator at approximately 1.7x the index, and tier-1 VC firms at approximately 2.9x the index.

How is VCBench constructed?

VCBench is built from LinkedIn and Crunchbase founder data through a four-stage pipeline: coverage improvement, format standardization and filtering, anonymization at the entry and dataset level, and iterative field selection driven by adversarial re-identification testing. Every stage is designed to serve quant VC evaluation specifically, where the dual goal is maximum predictive signal with minimum contamination from name-based LLM recall.

Scale and success definition. The benchmark contains 9,000 founder profiles, 810 of which are labeled successful (9 percent prevalence). Success means the founder's most recently founded company reached a $500M+ IPO, was acquired above $500M, or raised more than $500M in total funding. Unsuccessful means the company raised $100K to $4M but did not achieve an exit, IPO, or substantial follow-on funding within eight years of founding. Companies were mostly founded in the United States between 2010 and 2018. The 9 percent success prevalence is higher than the 1.9 percent real-world base rate, which is an intentional design choice to stabilize statistical testing at a manageable sample size.

Format standardization and filtering. LinkedIn contributes dense but irregular text. Crunchbase contributes structured but incomplete records. The cleaning pipeline reduced the industry vocabulary from 314 to 61 clusters (an 80.6 percent reduction), education degrees from 2,155 to 404 unique entries (81.3 percent), education records from 20,573 to 15,620 (24.1 percent), and job records from 45,975 to 41,183 (10.4 percent).

Anonymization. Founder names, company names, locations, and exact dates are removed. Industry labels are clustered into 61 groups, each containing at least 10 founders. Education prestige is preserved through QS university rankings (bucketed or unbucketed, depending on the leakage trade-off), and career durations are bucketed into year ranges. Adversarial re-identification testing was run with two classes of attacker: DeepSeek-R1 offline and Gemini-2.5-Pro with web search online. The final format reduces online re-identification from 77.0 percent (on the raw JSON) to 15.1 percent, and offline re-identification from 17.2 percent to 1.3 percent, a 92 percent reduction overall.

Iterative feature selection. Every anonymization change was proposed, tested against re-identification attacks, and accepted only if it reduced leakage while preserving predictive structure. Founders identified by the adversary on two or more occasions were removed entirely, further reducing residual contamination.

Private test fold. Only half of the 9,000 founders (4,500) are publicly released. The remaining half is held back as a private evaluation fold. Leaderboard submissions are scored on the private fold, so contamination in the pre-training corpus of future LLMs does not corrupt the benchmark.

Who's on the VCBench leaderboard?

The paper's primary evaluation covers nine vanilla LLMs against two human-expert baselines. Results are reported as precision, recall, and F0.5, which weights precision twice as heavily as recall to reflect the asymmetric false-positive cost of venture investing. Average performance across six folds of 1,500 founders each:

  • GPT-4o: 29.1 percent precision, 16.2 recall, F0.5 = 25.1 (highest F0.5)
  • DeepSeek-R1: 37.6 percent precision, 8.4 recall, F0.5 = 22.1
  • GPT-4o-mini: 29.5 percent precision, 10.1 recall, F0.5 = 21.2
  • o3: 42.4 percent precision, 7.0 recall, F0.5 = 20.9
  • Gemini-2.5-Pro: 17.2 percent precision, 59.0 recall, F0.5 = 20.1
  • Claude-3.5-Haiku: 16.9 percent precision, 48.6 recall, F0.5 = 19.4
  • GPT-5: 53.7 percent precision, 4.3 recall, F0.5 = 16.2
  • Gemini-2.5-Flash: 12.6 percent precision, 69.1 recall, F0.5 = 15.1
  • DeepSeek-V3: 59.1 percent precision, 3.0 recall, F0.5 = 11.8 (highest precision)

Human baselines on the same VCBench-scaled task: a random classifier matches the 9.0 percent prevalence, Y Combinator delivers roughly 1.7x the index, and tier-1 VC firms deliver roughly 2.9x. GPT-4o's 29.1 percent precision is approximately 3.2x the index and therefore exceeds both Y Combinator and tier-1 VC performance. The headline finding: vanilla LLMs already match or beat human experts on standardized quant VC screening, once you evaluate them on a clean benchmark that prevents identity leakage.

Vela's specialized quant VC models extend this frontier further. The Think-Reason-Learn family is reported on VCBench as part of standard evaluation, and the Vela production stack reaches 19 to 38 percent precision when scaled back to the real-world 1.9 percent base rate, a 10x to 20x lift over the index. VCBench is what makes that kind of claim verifiable rather than rhetorical.

How VCBench measures quant VC progress

Every claim about quant VC capability should reduce to a VCBench score. A method that cannot be evaluated on VCBench cannot be compared to prior work, cannot be compared to human baselines, and cannot be trusted in deployment. VCBench standardizes six things at once:

  • The task: predict, from a founder's LinkedIn and Crunchbase trail, whether their most recent company clears the $500M outcome threshold.
  • The data: 9,000 anonymized founder profiles with matched company outcome labels.
  • The success definition: $500M+ IPO, acquisition, or total funding.
  • The failure definition: $100K to $4M raised, no qualifying outcome within eight years.
  • The metric: F0.5, which weights precision 2x recall to reflect the asymmetric cost of false positives in venture.
  • The baselines: market index, Y Combinator, tier-1 VC firms, and the full vanilla-LLM frontier.

This shared infrastructure is what lets quant VC research compound. A new reasoning model, a new retrieval-augmented pipeline, a new feature-engineering technique, can be dropped into the evaluation loop and immediately scored against every prior result.

What makes VCBench auditable for quant VC research

VCBench's auditability story has three parts. First, the construction pipeline is fully described in the paper, including vocabulary-reduction counts at every stage, so researchers can trace exactly how the raw data became the benchmark. Second, every anonymization choice is justified by adversarial testing with reported identification rates across five candidate formats. Third, the per-fold scores across all six folds are reported for each of the nine evaluated LLMs, so submissions can be replicated fold-by-fold rather than only on the aggregate. Auditability at the benchmark level is as essential to quant VC as auditability at the model level.

How VCBench fits into Vela's quant VC research program

VCBench is the common evaluation infrastructure for everything else Vela publishes. Each of the four Think-Reason-Learn papers reports VCBench results. The VCBench paper explicitly cites Policy Induction (Mu et al., 2025) and Random Rule Forest (Griffin et al., 2025) as motivating work that established founder profiles alone carry strong predictive signal. SSFF, the multi-agent architecture that became V, uses VCBench-compatible evaluation protocols. Founder-GPT predates VCBench, but its descendants all report VCBench numbers.

Externally, VCBench is the quant VC equivalent of SWE-bench for software engineering or SDBench for medical diagnosis. It is designed as a living benchmark, updated as new founder cohorts mature and as new re-identification attacks appear, with an open leaderboard at vcbench.com.

Limitations

The paper is explicit about what VCBench does not resolve. The 9 percent prevalence is higher than the 1.9 percent real-world base rate, so precision multipliers reported on VCBench (GPT-4o's 3.2x, for example) may not hold exactly under the true real-world distribution. Human baselines are normalized for like-for-like comparison, but VC firms self-select their deal flow in ways a benchmark cannot replicate, so the machine-versus-human gap on VCBench may differ in practice from live deployment. The dataset covers US founders from 2010 to 2018 and inherits LinkedIn and Crunchbase coverage biases (stronger for tech startups and publicly visible founders). The eight-year success horizon introduces right-censoring, penalizing more recent cohorts whose outcomes have not yet had time to materialize. Residual noise remains after multistage cleaning.

Read the paper, use the benchmark

VCBench: Benchmarking LLMs in Venture Capital.
Rick Chen, Joseph Ternasky, Afriyie Samuel Kwesi, Ben Griffin, Aaron Ontoyin Yin, Zakari Salifu, Kelvin Amoaba, Xianling Mu, Fuat Alican, Yigit Ihlamur.
arXiv preprint, September 2025.
arXiv:2509.14448.

The live benchmark and public leaderboard: vcbench.com.

VCBench is the evaluation infrastructure for Vela's quant VC research program. For the reasoning-model line evaluated on it, see Think-Reason-Learn, GPTree, Random Rule Forest, Reasoned Rule Mining, and Policy Induction. For the multi-agent systems line, see Founder-GPT and SSFF.

Authored by members of the Vela team. See the full roster of contributors.

For research collaboration in quant VC benchmarking, LLM evaluation for venture capital, or leaderboard submissions, email engage@vela.partners.

Privacy