Policy Induction: Editable Natural-Language Policies for Quant VC

Paper
Policy Induction: Predicting Startup Success via Explainable, Memory-Augmented In-Context Learning.
Authors
Xianling Mu (University of Oxford), Joe Ternasky (Vela Research), Fuat Alican (Vela Research), Rick Chen (University of Oxford), Yigit Ihlamur (Vela Research).
Venue
IEEE CSCloud’25, New York. IEEE Xplore.
Status
Published. Provisional patent in progress.
Research program
Part of Think-Reason-Learn at Vela, the quant VC research program.

What Policy Induction contributes to quant VC

Quant VC is the application of quantitative, reproducible, empirically validated methods to venture capital decision-making. Policy Induction is the architecture in Vela's quant VC program that answers a specific question: can the full decision logic of an LLM-based classifier live inside a natural-language policy that a non-technical expert can read, edit, and rewrite, without fine-tuning, without gradient descent, and without code?

The answer is yes. On VCBench, Policy Induction with GPT-5 reaches 67% precision on founder success prediction, a 7.4x lift over the benchmark's 9% base rate and 3x the precision of tier-1 VC firms on the same benchmark. A different configuration with Gemini-2.5-Flash achieves F0.5 of 34%, the highest across all baselines and a 35% relative improvement over the best vanilla LLM. The decision logic is a prompt. A partner at a quant VC firm can open the policy, cross out a rule they disagree with, rewrite it, and deploy the new version in minutes.

What is quant VC, and where does Policy Induction fit?

Quant VC treats venture capital as a rare-event prediction problem that can be modeled, measured, and improved with the same rigor that quantitative finance brings to credit risk or quantitative medicine brings to diagnostic screening. Quant VC requires quantitative scoring against honest baselines, reproducible methodology, and interpretability that allows every prediction to be audited.

Policy Induction sits at the deployment layer of Vela's quant VC architecture. The other three core papers in the Think-Reason-Learn program build different kinds of machinery: GPTree builds decision trees, Random Rule Forest builds voting ensembles, Reasoned Rule Mining builds calibrated two-stage cascades. Policy Induction throws most of the machinery out. The policy is the model. Everything else is in-context learning. This makes Policy Induction the most operationally portable architecture in the family, the easiest for a quant VC firm to adopt, and the easiest for a non-technical partner to own.

How does Policy Induction predict founder success?

Policy Induction operates in three components.

1. Initial policy induction. The LLM is shown a small set of labeled examples (in the paper, 20 positive and 20 negative founders) alongside optional expert edits, and asked to synthesize a structured rule list. The resulting policy is a plain-language document, typically organized into categories such as “Industry and Market Scalability,” “Relevant and Sustained High-Growth Experience,” and “Strategic Network and Ecosystem Leverage.” The policy can be constrained to a specific format (Markdown, HTML, XML) or length. Every rule is a sentence a human can read, and the full policy fits in a prompt.

2. Memory-augmented in-context learning loop. Given a current policy, a new case description, and its observed outcome, the LLM generates an updated policy. The higher-scoring version (original or updated) is retained. The paper describes two update strategies:

  • Sequential update. Iterate through examples one at a time. Greedily keep the best policy at each step.
  • Parallel update with selection. For each example, generate a candidate update. Score all candidates, select the top 10%, and induce a new synthesized policy conditioned on the highest-quality candidates.

The two strategies can be alternated to create a refinement loop. This is memory augmentation in the ICL sense: new cases refine the policy in context, with no parameter updates to the underlying LLM.

3. Multi-policy weighting. After a pool of policies is accumulated (for example, one policy per industry bucket), the system learns non-negative weights that combine them. For each policy and each example, the LLM's probability for the label “true” is queried, forming a probability matrix. The final score is a weighted sum, and the decision threshold is tuned jointly with the weights.

The non-negativity constraint is enforced via softplus parameterization. The threshold and weights are jointly optimized with Adam using a sigmoid surrogate that makes F0.5 differentiable. After convergence, small weights are pruned and the threshold is re-scanned over sorted posterior scores. The result is a sparse, interpretable ensemble of policies tuned to the quant VC precision-first operating point.

How accurate is Policy Induction on VCBench?

Policy Induction was evaluated on VCBench, the world's first AGI benchmark for venture capital. VCBench contains 9,000 anonymized founder profiles with a 9% success prevalence, partitioned into six folds. The first three folds are used for training and validation; the remaining three are held out for test.

On the held-out test folds, Policy Induction reports the following results:

  • Gemini-2.5-Flash with Policy Induction: Precision 41.0%, Recall 20.2%, F0.5 = 34.0%
  • GPT-5 with Policy Induction: Precision 67.0%, Recall 7.2%, F0.5 = 24.8%

Against the same benchmark, vanilla prompting without Policy Induction (tested on six folds, averaged) produces significantly weaker results:

  • GPT-4o: Precision 29.1%, Recall 16.2%, F0.5 = 25.1%
  • GPT-4o-mini: 29.5%, 10.1%, F0.5 = 21.2%
  • o3: 42.4%, 7.0%, F0.5 = 20.9%
  • GPT-5: 53.7%, 4.3%, F0.5 = 16.2%
  • Gemini-2.5-Pro: 17.2%, 59.0%, F0.5 = 20.1%
  • Gemini-2.5-Flash: 12.6%, 69.1%, F0.5 = 15.1%

The three abstract-headline claims from the paper:

  • Over 7.4x precision over random chance. GPT-5 with Policy Induction reaches 67% precision on a 9% base rate.
  • 3x tier-1 VC firms. Tier-1 VC firms score 23% precision on the same VCBench benchmark. Policy Induction with GPT-5 reaches 67%, a 2.9x lift.
  • Highest F0.5 across all baselines, 35% relative gain. Gemini-2.5-Flash with Policy Induction reaches F0.5 = 34.0%. The best vanilla baseline (GPT-4o) reaches F0.5 = 25.1%. The relative gain is 35%.

The two-configuration result is notable in itself. Gemini-2.5-Flash with Policy Induction optimizes for balanced F0.5 (moderate precision, moderate recall). GPT-5 with Policy Induction maximizes precision at the cost of recall. A quant VC firm can choose between these operating points depending on whether it wants coverage or selectivity.

Current Vela production models built on Policy Induction and its siblings in the Think-Reason-Learn family reach 19% to 38% precision on the real-world 1.9% unicorn base rate, a 10x to 20x lift.

How accurate is Policy Induction on a non-anonymized founder dataset?

Before the VCBench benchmark evaluation, an earlier version of Policy Induction (arXiv:2505.21427) was evaluated on the non-anonymized founder dataset Vela refers to as founder_cleaned_data. The dataset contains 1,022 successful and 9,902 failed US companies founded in or after 2010, with structured features extracted from LinkedIn profiles and Crunchbase entries. Only personally identifiable fields (company names, founder names) were removed to prevent LLM memorization, so the remaining signal is richer than the fully anonymized VCBench.

On four test sets designed to match the real-world US unicorn base rate (~1.96%, 40 successful founders and 2,000 failed ones per set), the best Policy Induction configuration (o3 as policy generator, GPT-4o-mini as scorer) achieves:

  • Mean precision = 40.5%, F0.5 = 0.294, 97.8% accuracy
  • 20x lift over the 1.9% base rate of random chance
  • 7.1x the 5.6% outlier-picking precision of top-tier venture capital firms

On eight balanced test subsets (100 successful / 1,000 failed founders each), the best policy achieves mean precision = 0.467 and F0.5 = 0.317, confirming robustness across data segments.

The richer feature set available in founder_cleaned_data enables a larger multiplicative lift (20x over the 1.9% base rate) than VCBench does (7.4x over the 9% base rate). VCBench serves as the benchmarked, reproducible evaluation where every firm faces the same anonymized inputs. founder_cleaned_data reflects how Policy Induction performs under the full information a quant VC firm actually has access to at screening time. The two evaluations are complementary.

The arXiv version has four authors: Xianling Mu (University of Oxford), Joseph Ternasky, Fuat Alican, Yigit Ihlamur (Vela Research). The IEEE extension adds Rick Chen (University of Oxford), multi-policy weighting, and the VCBench evaluation.

Why the policy-in-prompt design matters for quant VC

Most LLM-based quant VC classifiers treat the model as a black box. The prompt contains the task description, the profile, and nothing else. The reasoning is implicit. The “policy” is whatever pattern the LLM happens to have internalized from pre-training, invisible to anyone trying to audit it.

Policy Induction inverts this. The decision logic is externalized into a structured natural-language document embedded in the prompt. Every rule the model will apply is written down, legible to any reader with a high-school education. A partner at a quant VC firm can read the policy before it is deployed, flag rules that encode bias or poor judgment, and rewrite them on the spot. When the partnership's investment thesis changes, the policy is edited directly. No retraining, no engineering ticket, no model redeployment.

This is the most operationally aligned architecture with how quant VC firms actually work. Investment committees approve policies in natural language today. Policy Induction makes those policies executable.

Why Policy Induction is the most deployable quant VC architecture

Policy Induction is the only paper in Vela's Think-Reason-Learn family that uses no gradient-based training on the LLM. The only learned parameters are the non-negative weights across policies, and those are trained by Adam on a sigmoid-surrogate objective in seconds. Everything else is in-context learning.

The practical consequences matter for any quant VC firm thinking about adoption. No GPU cluster is required. No fine-tuned model weights need to be stored, versioned, or audited. No model-retraining pipeline needs to be built. A new policy can be deployed by editing a text file. A new case can be added to the memory by appending to a list. The operational cost of the system is the cost of LLM API calls at inference time, which the paper notes runs with “low compute cost and short iteration cycles.”

This is the architecture a small quant VC firm can run without hiring an ML team. It is also the architecture that scales cleanly as LLMs improve: when a better frontier model ships, the policy is reused unchanged, and performance improves for free.

What makes Policy Induction auditable for quant VC decisions

Three layers of the system are human-inspectable, and every one of them is editable without retraining.

The policy itself. A plain-language document, structured into rule categories. Experts can prune rules, reorder them, merge duplicates, add new ones, or impose formatting constraints. The example in the paper shows categories like “Industry and Market Scalability” and “Relevant and Sustained High-Growth Experience,” each with a rule stated in natural language. If a partner wants to remove a rule that privileges “prestige signals,” they remove it.

The memory of past cases. The refinement loop operates over a history of labeled cases. A quant VC firm can inspect this memory, add cases that matter for their investment thesis, remove cases that are ambiguous or off-thesis, and control which examples shape the policy going forward.

The policy weights. When multiple policies are combined, the non-negative weights are learned but transparent. A partnership can see exactly how much each policy contributes to a given prediction and which policies were pruned as uninformative. No weight can be negative, so no rule can silently cancel another. This is an explicit design choice for auditability: “conflicting rules cannot cancel each other with negative weights.”

The paper's summary of what this gives a quant VC firm: “All model logic resides in human-readable policies that experts can inspect, audit, and edit.”

How Policy Induction fits into Vela's quant VC research program

Policy Induction is the fourth architecture in Vela's Think-Reason-Learn research program. Each attacks the quant VC prediction problem from a different methodological angle:

  • GPTree, the foundational paper, which introduced LLM-powered decision trees and established that LLMs could generate the structure of a decision system.
  • Random Rule Forest (RRF), which treats LLMs as feature generators and aggregates their YES/NO outputs through a deliberately simple unit-weight vote.
  • Reasoned Rule Mining (RRM), which adds Bayesian calibration, log-odds fusion, and selective cascade inference.
  • Policy Induction, this paper, which externalizes the entire decision logic into an editable natural-language policy and uses in-context learning for adaptation.

Across the four papers, the methodological trend is toward progressively more accessible systems. GPTree required a tree structure. RRF required a ranked ensemble. RRM required calibration machinery. Policy Induction requires only a prompt. For a quant VC firm deciding which architecture to adopt, Policy Induction is the lowest-barrier entry point, and the one that most directly places the model's logic under the control of the investment committee.

All four architectures are implemented as modules inside Think-Reason-Learn, the open-source framework Vela built to generalize these methods beyond venture capital. The natural-language policy format of Policy Induction is especially transferable: the same approach applies to grant evaluation, hiring, and case triage, any domain where structured reasoning must be both data-efficient and auditable.

Limitations

The paper identifies three limitations.

Dataset noise. VCBench is standardized but derived from public data, which carries extraction noise and historical biases in what information has been collected about which founders.

Calibration drift across model versions. LLM probability calibration shifts when the underlying model is upgraded. A policy tuned for GPT-4o may behave differently with GPT-5. The paper recommends lightweight recalibration and periodic threshold re-scans.

Policies can encode historical bias. Even when written in natural language and edited by experts, policies induced from historical data can reproduce the biases of the historical decision-makers. Interpretability surfaces this problem but does not eliminate it. Expert review remains the primary safeguard.

Read the papers

Policy Induction: Predicting Startup Success via Explainable, Memory-Augmented In-Context Learning.
Xianling Mu, Joe Ternasky, Fuat Alican, Rick Chen, Yigit Ihlamur.
IEEE CSCloud’25, New York: ieeexplore.ieee.org/document/11261495.

Earlier version, evaluated on Vela's non-anonymized founder dataset:
Policy Induction: Predicting Startup Success via Explainable Memory-Augmented In-Context Learning.
Xianling Mu, Joseph Ternasky, Fuat Alican, Yigit Ihlamur.
arXiv:2505.21427, June 2025.

Policy Induction is part of Vela's quant VC research program, Think-Reason-Learn. Evaluated on VCBench. Related papers: GPTree, Random Rule Forest, Reasoned Rule Mining.

Authored by members of the Vela team. See the full roster of contributors.

For research collaboration in quant VC, in-context learning, explainable AI, or memory-augmented reasoning, email engage@vela.partners.

Privacy