Random Rule Forest: Interpretable Ensembles for Quant VC

Paper
Random Rule Forest (RRF): Interpretable and Manageable Ensembles of LLM-Generated Questions for Predicting Success from Unstructured Data.
Authors
Ben Griffin (University of Oxford), Ugur Koyluoglu (Oliver Wyman), Diego Vidaurre Henche (University of Oxford), Joseph Ternasky (Vela Research), Fuat Alican (Vela Research), Aaron Ontoyin Yin (Vela Research), Yigit Ihlamur (Vela Research).
Venue
ECML PKDD submission, 2026.
Preprint
arXiv:2505.24622
Status
Provisional patent. Code and anonymized founder dataset publicly released.
Research program
Part of Think-Reason-Learn at Vela, the quant VC research program.

What Random Rule Forest contributes to quant VC

Quant VC is the application of quantitative, reproducible, empirically validated methods to venture capital decision-making. Random Rule Forest (RRF) is one of the core prediction architectures behind Vela's quant VC program. It was developed by Vela Research in collaboration with the University of Oxford and Oliver Wyman, and it operationalizes a specific claim about how quant VC should work: that a large language model is most useful as a feature generator, not as an end-to-end judge, and that a deliberately simple voting rule on top of those features produces high-precision, fully auditable investment screening under the extreme class imbalance that defines venture capital.

RRF achieves 13.1% precision on unicorn identification at a 1.9% base rate, a 6.9x lift over random chance and well above what the best human venture capital firms achieve. With a small set of expert-informed questions added to the LLM-generated pool, precision rises to 15.3%. These are headline quant VC results on a dataset of 9,892 US founders, evaluated with 10-fold nested cross-validation repeated 10 times.

What is quant VC, and how does RRF exemplify it?

Quant VC treats venture capital as a rare-event prediction problem that can be modeled, measured, and improved the way quantitative finance treats loan portfolios or quantitative hiring treats candidate evaluation. The defining characteristics of a quant VC approach are: quantitative scoring rather than qualitative consensus; reproducible methodology rather than case-by-case judgment; honest baselines rather than cherry-picked wins; and interpretability that allows every prediction to be audited by a partner, a founder, or an LP.

RRF exemplifies all four. It produces a numerical precision score benchmarked against a 1.9% base rate. Its methodology is fully specified in a public paper with open code. Its baselines include three LLM prompting strategies, a learned-weight elastic net, Y Combinator, and nine tier-1 VC firms. And every RRF prediction decomposes into a short list of YES/NO questions that a human can read top to bottom. This is what quant VC looks like when it is actually built, rather than asserted.

How does RRF predict startup success?

RRF is a seven-stage pipeline. An LLM generates a large pool of natural-language YES/NO questions from labeled founder profiles (“Has the founder previously raised a large institutional financing round?”; “Is the founder's university ranked among the top 50 globally?”). Each question is evaluated on every founder, producing a binary response matrix. Predictively redundant questions are filtered using Hamming distance. The remaining questions are ranked by individual F0.5 score, and the top N are combined via a threshold rule: if at least T of the N questions are answered YES for a given founder, the founder is classified as likely to succeed. N and T are tuned via nested cross-validation.

The paper uses GPT-4o-mini as the backend LLM. Question generation runs once on a held-out design set of 500 founders. All model selection and evaluation happens on the remaining 9,392 founders. The modal ensemble size is 7 to 8 questions with a voting threshold of 5 to 6, a classifier that a partner at a quant VC firm can literally read in under a minute.

How accurate is RRF in quant VC terms?

RRF was evaluated on 9,892 US-founded companies from 2010 to 2016 that raised $100K to $4M in seed funding. Success is defined as a $500M IPO, acquisition, or funding round. The base rate is 1.9%, matching the real-world US unicorn formation rate.

On this dataset, RRF achieves the following precision results:

  • Indexing strategy (random selection): 1.9%
  • Y Combinator: 3.2%
  • Nine tier-1 VC firms (Khosla, Sequoia, Mayfield, Foundation, Benchmark, a16z, Kleiner Perkins, Accel, GV): 5.6%
  • Best LLM prompting baseline (o3 few-shot): 7.7%
  • RRF (LLM-only): 13.1%, a 6.9x lift
  • RRF (LLM + expert-informed questions): 15.3%, an 8x lift
  • RRF (LLM + expert, peak precision across folds): 17.6%

On F0.5, RRF scores 0.118 versus 0.088 for the best prompting baseline, a 34% relative improvement that is statistically significant at p < 0.001. RRF optimizes for F0.5 rather than F1 because in quant VC, the cost of funding a bad founder is much higher than the cost of missing a good one, and F0.5 weights precision twice as heavily as recall.

Current Vela quant VC production models built on RRF and its siblings in the Think-Reason-Learn family reach 19% to 38% precision on the same scaled real-world basis, a 10x to 20x lift over the US unicorn base rate.

Why RRF works as a quant VC architecture

RRF's performance rests on a specific empirical hypothesis that the paper tests directly: LLMs are most valuable as feature generators under extreme class imbalance, not as end-to-end predictors. Two baselines test the hypothesis from opposite sides.

The first baseline is direct LLM prompting (zero- and few-shot). These approaches ask the LLM to make a single-shot success judgment from raw founder text. They underperform RRF on precision and produce erratic precision-recall trade-offs that cannot be tuned to a specific operating point. In quant VC terms, an unaided LLM is too noisy to use for allocation.

The second baseline is a learned-weight elastic net logistic regression trained on the same binary question-response matrix RRF uses. This baseline tests whether the gain comes from the LLM-generated features alone. It does not: elastic net achieves F0.5 of 0.080, comparable to the prompting baselines and well below RRF.

The gain therefore comes from the combination: LLMs generate interpretable, conceptually diverse binary features, and a deliberately simple unit-weight voting rule aggregates them. Under extreme class imbalance, estimating non-uniform weights is high-variance, and unit weights act as regularization. This echoes classical results on “improper linear models” (Dawes 1979, Einhorn and Hogarth 1975): in noisy, low-data regimes, simple tallying rules often beat estimated coefficients.

This is a quant VC result, but it is also a general result about LLMs in high-stakes prediction.

What makes RRF auditable for quant VC decision-making

Every RRF prediction decomposes into a short, human-readable list of questions and answers. A founder classified as likely to succeed has answered YES to at least T of N questions. A partner at a quant VC firm can read each question, see each answer, and agree or disagree with the prediction on the specific grounds the model used. There is no latent space to interpret, no post-hoc explanation layer, and no rationalization.

The ensemble is small enough to audit completely. In the paper's modal configuration, the classifier is 7 to 8 questions with a threshold of 5 to 6. A quant VC partnership can adopt this scorecard, use it, and defend it to an LP committee without invoking an external interpretability tool.

The expert-in-the-loop extension makes the model directly editable. Domain experts can add their own questions to the candidate pool, and those expert-written questions are scored, ranked, and filtered by the same procedure as LLM-generated ones. Experts can also audit the ranked list and remove questions they consider misaligned with the firm's investment criteria. In the paper, expert questions often rank highly on predictive performance and contribute directly to the final ensemble, raising precision from 13.1% to 15.3%. Quant VC is strongest when it combines quantitative methods with domain knowledge; RRF's architecture builds that combination in.

Does RRF generalize beyond quant VC?

Yes. The same pipeline was evaluated on the Trial Outcome Prediction (TOP) Phase I benchmark for clinical trial success prediction (Fu et al., 2022). On the standard time-based split, RRF achieves PR-AUC 0.638 and ROC-AUC 0.596, outperforming all published baselines including HINT (PR-AUC 0.567), COMPOSE (0.564), DeepEnroll (0.568), and FFNN (0.547).

This is a direct demonstration that an architecture designed for quant VC transfers to high-stakes prediction in other text-heavy expert domains. The same four components generalize: LLM as feature generator, binary question representation, predictive redundancy filtering, and threshold-based voting. Any domain with rare events, unstructured text, and a need for auditable predictions, including grant review, legal case triage, hiring, and medical screening, is a candidate.

How RRF fits into Vela's quant VC research program

RRF is one of four core architectures in Vela's Think-Reason-Learn research program. The others are:

  • GPTree, the foundational paper, which introduced LLM-powered decision trees and established that LLMs could generate the structure of a decision system.
  • Reasoned Rule Mining (RRM), which adds Bayesian calibration and log-odds fusion to the rule-based approach. Won the Commendation Award at ICIM 2026, Oxford.
  • Policy Induction, which moves the reasoning into editable natural-language policies embedded in prompts. Reaches 20x precision over random chance on the same founder evaluation task.

GPTree uses hierarchical trees. RRF uses flat ensembles. RRM adds probabilistic calibration. Policy Induction uses in-context memory. Each attacks the quant VC prediction problem from a different architectural angle, and each contributes different strengths to the production models that Vela deploys.

All four papers are implemented as modules inside Think-Reason-Learn, the open-source framework Vela built to generalize these architectures beyond venture capital.

Limitations

The paper identifies four limitations. First, question generation is stochastic: reruns with different prompts, temperatures, or seeds can produce different question pools. Second, while the final RRF predictor is interpretable, the LLM question-generation step is not; the paper does not characterize how prompt choices systematically shape the induced rule set. Third, learned signals may shift across time, geography, or data-collection pipelines and require re-validation on the target distribution. Fourth, RRF is evaluated as a screening task and does not model downstream constraints such as deal access, allocation limits, or portfolio-level trade-offs; reported metrics should be interpreted as quant VC screening performance, not end-to-end fund return.

Read the paper

Random Rule Forest (RRF): Interpretable and Manageable Ensembles of LLM-Generated Questions for Predicting Success from Unstructured Data.
Ben Griffin, Ugur Koyluoglu, Diego Vidaurre Henche, Joseph Ternasky, Fuat Alican, Aaron Ontoyin Yin, Yigit Ihlamur.
ECML PKDD 2026 submission. Provisional patent filed.

Preprint: arXiv:2505.24622.

Code and anonymized founder dataset: github.com/rrfanon2026/rrf_submission.

RRF is part of Vela's quant VC research program, Think-Reason-Learn. Related papers: GPTree, Reasoned Rule Mining, Policy Induction.

Authored by members of the Vela team. See the full roster of contributors.

For research collaboration in quant VC, expert decision systems, or interpretable machine learning, email engage@vela.partners.

Privacy