LLM-AR: Neural-Symbolic Reasoning for Quant VC Founder Evaluation

Paper
LLM-AR: LLM-powered Automated Reasoning Framework.
Authors
Rick Chen (University of Oxford), Joseph Ternasky, Aaron Ontoyin Yin, Xianling Mu (University of Oxford), Fuat Alican, Yigit Ihlamur (Vela Research).
Venue
arXiv preprint, October 2025.
Status
Preprint. arXiv:2510.22034.
Research program
Part of the LLM-Augmented ML line of Vela's quant VC research, which embeds LLMs as components inside classical ML pipelines rather than as standalone reasoners.

What LLM-AR contributes to quant VC

LLM-AR introduces a neural-symbolic architecture for quant VC founder evaluation. The LLM proposes prediction rules in natural language, those rules are translated into ProbLog (a probabilistic extension of Prolog), and ProbLog executes them as a deterministic automated reasoning engine to produce the final success or failure prediction. The LLM contributes heuristic pattern-finding. ProbLog contributes transparent and reproducible inference.

On a 4-fold cross-validated dataset of 6,000 US founders with a 10 percent success prevalence, LLM-AR reaches 59.5 percent precision at 8.7 percent recall, a 5.9x lift over the random baseline. It outperforms every vanilla LLM tested on the same data (GPT-4o mini at 49.5 percent, GPT-4o at 32.3 percent, DeepSeek-V3 at 31.0 percent, o3-mini at 21.6 percent) and outperforms the tier-1 seed-fund baseline of 29.5 percent. Ablations confirm both halves of the architecture are load-bearing: removing iterative training drops precision to 36.1 percent, and replacing ProbLog with a vanilla LLM (GPT-4o mini) at inference drops precision to 46.2 percent.

What is quant VC, and where does LLM-AR fit?

Quant VC is the application of quantitative, reproducible, empirically validated methods to venture capital decision-making. Quant VC treats venture capital as a rare-event prediction problem that can be modeled, measured, and improved with the same rigor that quantitative finance brings to credit risk or quantitative medicine brings to diagnostic screening. Quant VC requires quantitative scoring against honest baselines, reproducible methodology, and interpretability that allows every prediction to be audited.

LLM-AR sits in the LLM-Augmented ML strand of Vela's quant VC research. This strand wraps LLMs inside classical ML pipelines (neural-symbolic architectures, ensemble classifiers, feature engineering), treating the LLM as a specialized component rather than the final decision-maker. This is distinct from both the Multi-Agent Systems line (Founder-GPT, SSFF), which orchestrates the LLM as multiple reasoning agents, and the Think-Reason-Learn family, which uses the LLM to induce auditable symbolic rules evaluated directly. LLM-AR splits the difference: the LLM induces rules, but classical symbolic AI executes them.

How does LLM-AR predict founder success?

LLM-AR has four conceptual components: policy generation, statistical calibration, iterative refinement, and ProbLog execution. A policy is a set of prediction rules of the form “IF condition_1 AND condition_2 THEN success/failure”, each with an associated probability.

Policy generation. For each founder in a training batch, the LLM (DeepSeek-V3 in the paper's implementation) writes a short natural-language explanation of why the founder succeeded or failed. The batch-level insights are then summarized into a list of logical prediction rules, each assigned a probability by the LLM.

Statistical calibration. Every generated rule is audited against data. Association-rule mining identifies feature combinations that are statistically associated with success or failure. Rule probabilities are then recalibrated by sampling 1,000 training founders and computing the empirical success rate conditional on the rule preconditions. Rules with too few matching founders are flagged “not enough samples” and reviewed for removal.

Iterative refinement. The LLM is shown the statistical analysis of its own policy and asked to reflect: refine probabilities, remove rules with weak predictive power, and incorporate newly surfaced rules from association-rule mining. This reflection loop runs for multiple iterations. Training-time policy evaluation every five iterations guards against regression, using an LLM evaluator to analyze how rule changes affected the F0.25 score over time.

ProbLog execution. At inference, the final policy is translated into ProbLog syntax, founder traits are converted to ProbLog facts with probabilities reflecting strength of evidence, and two queries are executed: one for success probability, one for failure probability. The founder is classified as successful only if the success probability clears a tuned threshold AND the failure probability falls below a second threshold. Both thresholds are optimized on a held-out validation set against F0.25, which weights precision four times more heavily than recall to match the asymmetric cost of false positives in quant VC.

How accurate is LLM-AR?

LLM-AR was evaluated through 4-fold cross-validation on a 6,000-founder dataset with 10 percent success prevalence (600 successful, 5,400 unsuccessful). Success means the founder's company reached a $500M+ IPO, $500M+ acquisition, or raised more than $500M. Unsuccessful means the company raised between $100K and $4M.

Performance on test folds, against baselines on the same 10 percent prevalence data:

  • LLM-AR: 59.5 percent precision, 8.7 recall, F0.25 = 41.6 percent.
  • GPT-4o mini: 49.5 percent precision, 8.0 recall, F0.25 = 37.8.
  • GPT-4o: 32.3 percent precision, 18.7 recall, F0.25 = 30.9.
  • DeepSeek-V3: 31.0 percent precision, 15.2 recall, F0.25 = 29.1.
  • Tier-1 seed funds (scaled from real-world): 29.5 percent precision.
  • o3-mini: 21.6 percent precision, 31.3 recall, F0.25 = 22.0.
  • Random classifier: 10.0 percent precision.

The 59.5 percent precision represents a 5.9x lift over the random baseline. Validation-set performance is higher (66.2 percent precision, 9.7 recall, F0.25 = 47.7), confirming that most of the signal generalizes to unseen folds but that threshold optimization is partition-dependent.

Ablations confirm each component contributes. Without iterative training, precision drops to 36.1 percent. With GPT-4o mini substituting for ProbLog at inference, precision drops to 46.2 percent. Both architectural choices are doing real work.

LLM-AR is also tunable without retraining. By optimizing the threshold pair against different F-beta values, the same policy produces a spectrum of precision-recall trade-offs: at beta = 0.125, the model reaches 100 percent precision at 2 percent recall; at beta = 4, it reaches 92 percent recall at 12.5 percent precision. Venture investors can dial this to match their own portfolio construction.

Vela's full production quant VC stack, across the Think-Reason-Learn family and related research, reaches 19 to 38 percent precision when scaled back to the 1.9 percent real-world base rate, a 10x to 20x lift over the index. LLM-AR contributes the neural-symbolic architecture to that broader program.

Why neural-symbolic reasoning matters for quant VC

Pure LLMs are unreliable decision-makers at high-stakes precision. The LLM-AR experiments show why: GPT-4o, DeepSeek-V3, and o3-mini all produced either inconsistent precision or unacceptable recall on the same task, and even the best vanilla LLM (GPT-4o mini at 49.5 percent precision) lagged LLM-AR by 10 percentage points. The bottleneck is not reasoning capability. It is reasoning reproducibility. The same prompt can yield different answers, and the decision path is not inspectable.

Neural-symbolic architectures address both problems. The LLM does what it is good at (surfacing heuristic patterns from unstructured founder data) and a symbolic engine does what it is good at (executing those patterns deterministically and transparently). For quant VC, where a partner has to be able to inspect the reasoning chain and override a recommendation, neural-symbolic is the natural integration point.

What makes LLM-AR auditable for quant VC decisions

Every LLM-AR decision decomposes into three inspectable artifacts: a human-readable policy (the set of prediction rules), per-rule probabilities that were calibrated against real data, and a ProbLog execution trace showing which rules fired with what probabilities. A partner can open the policy, edit a rule by hand, redeploy the model, and get a new prediction immediately. The paper explicitly calls out “expert-in-the-loop” as a design goal: human experts can modify the policy with domain knowledge between iterations. This is the auditability standard Vela applies across its quant VC stack.

How LLM-AR fits into Vela's quant VC research program

LLM-AR sits in the LLM-Augmented ML family at Vela, alongside related papers that embed LLMs into classical ML pipelines:

  • Same family: LLM-AR, GPT-HTree (hierarchical clustering plus LLM personas), rare-event prediction with LLM feature engineering, and verifiable reasoning (LLMs as code generators) all share the thesis that the LLM is most useful as a component inside a tested ML pipeline.
  • Adjacent via method: LLM-AR's rule-induction step is conceptually close to the Think-Reason-Learn family (GPTree, Random Rule Forest, Reasoned Rule Mining, Policy Induction), which also uses LLMs to induce interpretable decision rules. The key difference: TRL models evaluate rules directly; LLM-AR hands rules off to ProbLog for probabilistic symbolic execution.
  • Adjacent via pattern: The multi-agent Founder-GPT and SSFF papers orchestrate LLMs as reasoning agents. LLM-AR instead pairs a single LLM with a symbolic engine.
  • Benchmarking: Related methods developed at Vela are evaluated on VCBench, the public benchmark for quant VC.

Limitations

The paper is explicit about what LLM-AR does not yet achieve. The 10 percent success prevalence of the evaluation dataset is inflated from the 1.9 percent real-world base rate, so absolute precision numbers should be interpreted cautiously when extrapolating to deployment. Policy size and evaluation speed are constrained by the Python ProbLog implementation, which is slower than ProbLog's own reference engine. The training loop does not guarantee monotone improvement across iterations, which the authors mitigate by testing multiple candidate policies rather than only the final one. The LLM can only reason over the 52 engineered features in the fixed dataset, because the statistical calibration step requires numeric representations. All features are existing founder attributes rather than newly discovered ones.

Read the paper

LLM-AR: LLM-powered Automated Reasoning Framework.
Rick Chen, Joseph Ternasky, Aaron Ontoyin Yin, Xianling Mu, Fuat Alican, Yigit Ihlamur.
arXiv preprint, October 2025.
arXiv:2510.22034.

LLM-AR is part of the LLM-Augmented ML family of Vela's quant VC research. For related work within the same family, see GPT-HTree, rare-event prediction, and verifiable reasoning. For the adjacent rule-induction line, see the Think-Reason-Learn family, including GPTree, Random Rule Forest, Reasoned Rule Mining, and Policy Induction. For the multi-agent line, see Founder-GPT and SSFF. For the benchmark that measures progress across all three families, see VCBench.

Authored by members of the Vela team. See the full roster of contributors.

For research collaboration in quant VC, neural-symbolic AI, and automated reasoning for founder evaluation, email engage@vela.partners.

Privacy