Reasoned Rule Mining: Calibrated LLM Classification for Quant VC

Paper
Reasoned Rule Mining: Precision-Optimised Weighted LogProb Classification.
Authors
Jack Preuveneers (University of Oxford, Department of Engineering Science), Yigit Ihlamur (Vela Research).
Venue
IEEE CSCloud / SecureFinAI, 2025 (shorter version). IEEE Xplore.
Status
Commendation Award, ICIM 2026, Oxford (in process of printing). Non-provisional patent in progress.
Research program
Part of Think-Reason-Learn at Vela, the quant VC research program.

What Reasoned Rule Mining contributes to quant VC

Quant VC is the application of quantitative, reproducible, empirically validated methods to venture capital decision-making. Reasoned Rule Mining (RRM) is the architecture in Vela's quant VC program that addresses a specific problem: large language models produce poorly calibrated outputs, and those outputs cannot be trusted as decision signals without principled probability calibration. RRM is a two-stage framework that converts LLM outputs into calibrated, auditable classifiers suitable for the asymmetric cost structure of quant VC screening.

On a reviewed subset of 8,000 founders with 160 positives drawn from a pool of 9,892, RRM achieves 24.5% precision at 15% recall, with F0.5 = 0.217. Against the dataset's 2% prevalence, this is a 12.3x uplift in precision. The compute-aware cascade reduces expensive second-stage evaluations by 91.3%, turning a brute-force LLM classifier into a selective one without sacrificing the precision-first operating point.

What is quant VC, and where does RRM fit?

Quant VC treats venture capital as a rare-event prediction problem that can be modeled, measured, and improved with the same rigor that quantitative finance brings to credit risk or quantitative medicine brings to diagnostic screening. Quant VC requires quantitative scoring against honest baselines, reproducible methodology, and interpretability that allows every prediction to be audited.

RRM sits at the statistical core of Vela's quant VC architecture. GPTree introduced the idea that LLMs could generate the structure of a decision system. Random Rule Forest showed that LLMs are most effective as feature generators combined with simple voting rules. RRM goes further: it treats the conversion of an LLM into a reliable quant VC classifier as fundamentally a calibration problem, and it builds the mathematical machinery (Platt scaling, log-odds fusion, conditional-independence decomposition, threshold optimization) needed to solve it. For quant VC firms that need probability estimates they can defend to a partner meeting or an LP committee, this is the paper that provides the framework.

How does RRM predict founder success?

RRM operates in three stages.

Stage 1: Rule mining from reasoning traces. The LLM is prompted to produce a stepwise analysis for each training example, where each example is a founder profile paired with the true outcome (successful or unsuccessful). Token-level log-probabilities are recorded alongside the reasoning text. A structured prompt-engineering parser then distills each rationale into a compact IF-THEN predicate about the founder's personal attributes. The rules are scored by perplexity, and only rules below the 95th percentile threshold are retained. This perplexity filter is a principled guard against LLM hallucination: rules the model generated with low confidence are dropped before they can contaminate the decision policy. Surviving rules are deduplicated and consolidated into a structured, human-readable policy document.

Stage 2: Cascade inference. At prediction time, a lightweight rules-first pass screens most founders using the mined policy. The LLM is run as a best-of-three ensemble with consensus voting and confidence thresholding. Only founders classified as “High” at at least 95% confidence are promoted to the second stage, where a stricter harsh re-evaluation pass produces refined probabilities. This cascade is compute-aware: easy cases resolve cheaply, hard cases get more reasoning budget. In the paper, the cascade reduces Stage 2 calls from 8,000 to 698, a 91.3% compute reduction.

Stage 3: Numerical adjudication. Raw LLM log-probabilities are calibrated to proper posteriors using Platt scaling. Calibrated evidence from both stages is fused on the log-odds scale, which under conditional independence decomposes additively as log P(y=1|x) / P(y=0|x) = log(π/(1-π)) + ℓ₁(x) + ℓ₂(x). When stage scores are correlated, a meta-logistic combiner (analogous to stacked generalization) learns the stage weights. The final operating threshold is selected by scanning the ranked posteriors and maximizing F0.5.

The entire framework is domain-agnostic. Quant VC is the motivating use case, but RRM applies to any high-stakes binary screening problem with class imbalance and asymmetric error costs.

How accurate is RRM?

RRM was evaluated on a reviewed subset of 8,000 founders drawn from a pool of 9,892 profiles collected from Crunchbase and LinkedIn. The subset contains 160 labeled successes, for a base rate of approximately 2%, close to the real-world US unicorn formation rate. Success is defined as a $500M or greater IPO, acquisition, or total funding.

On this dataset, RRM reports the following results:

  • Best-threshold mode (recall at least 0.1): Precision = 24.5%, Recall = 15.0%, F0.5 = 0.217. 12.3x uplift over the 2% base rate.
  • Precalibrated mode (out-of-sample): Precision = 17.3%, Recall = 13.2%, F0.5 = 0.162. 8.65x uplift.
  • Single-model LLM baseline (no pipeline): Precision = 4.5%, Recall = 18.0%, F0.5 = 0.053.

The single-model LLM baseline underperforms RRM on every metric that matters for quant VC: 5.4x lower precision, 4x lower F0.5, and no ability to tune the operating point. Precision at the cost of recall is the correct trade-off for a quant VC firm, where the cost of funding a bad founder is much higher than the cost of missing a good one, and RRM's design delivers exactly that.

Current Vela production models built on RRM and its siblings in the Think-Reason-Learn family reach 19% to 38% precision on the same scaled real-world 1.9% unicorn base rate, a 10x to 20x lift.

Why calibration is the core quant VC problem

Most attempts to use LLMs as quant VC classifiers fail in a specific and predictable way. An LLM's raw output token probabilities are not proper posteriors; they are sensitive to prompt wording, decoding strategy, and class priors the model never saw. A “top-token equals class” heuristic produces classifications that look confident but are systematically miscalibrated. In a rare-event setting like quant VC (2% base rate in this paper), this miscalibration destroys precision at exactly the operating point that matters.

RRM's central insight is that converting an LLM into a reliable quant VC classifier is fundamentally a calibration problem, not a prompt-engineering problem. The framework treats raw LLM log-probabilities as stage-specific scores that must be mapped to proper posteriors through Platt scaling, then fused with other evidence on the log-odds scale under explicit conditional-independence assumptions. Threshold selection becomes an optimization problem with a stated utility function (F0.5), not a guess.

This is the general recipe the paper proposes for turning any generative LLM into a dependable, precision-optimized classifier. It is why RRM generalizes to screening, triage, hiring, and grant review, and it is why the framework has methodological value beyond quant VC itself.

Why selectivity beats brute force in quant VC inference

A second contribution of RRM is the compute-aware cascade. In a standard LLM classifier, every founder incurs the same inference cost, regardless of how obvious the case is. RRM instead handles the majority of founders with a lightweight rules-first screen and reserves the expensive harsh-evaluation stage for borderline cases whose stage-one confidence falls below 95%.

This matters for quant VC firms operationally. A fund evaluating thousands of founders per year cannot afford to run a full reasoning cascade on every profile, and most founders are clear negatives that can be resolved with a short rule check. In the paper's evaluation, only 698 of 8,000 founders (8.7%) required the expensive Stage 2 harsh re-evaluation. The remaining 91.3% were resolved by the rules-first pass. This is an order-of-magnitude compute reduction with no loss of precision at the chosen operating point.

What makes RRM auditable for quant VC decisions

Every RRM prediction rests on three components, all of which a quant VC partner can inspect and edit.

The rule policy. The mined IF-THEN predicates are human-readable and expressed in terms of the founder's personal attributes. An expert can remove any rule that seems implausible, biased, or unaligned with the firm's investment criteria, and the downstream calibration and fusion steps operate only on the vetted rule set. No opaque proxy can be reintroduced by the statistical layer.

The calibrated probability. Because RRM outputs calibrated posteriors rather than raw LLM confidence, the firm can set a meaningful operating threshold and report calibration curves to LPs. A 25% posterior means a 25% posterior, and a reliability diagram can be produced to confirm it.

The fusion logic. The log-odds fusion step is an additive sum under conditional independence, or a learned stacked combiner when stages are correlated. Either way, the contribution of each stage is explicit and editable. A partnership can inspect how much of a classification is coming from rule-based evidence versus second-stage harsh evaluation, and adjust the weights if the mixture is wrong.

This is quant VC infrastructure that remains legible to non-technical experts. That legibility is the condition Vela's research program treats as non-negotiable.

How RRM fits into Vela's quant VC research program

RRM is one of four core architectures in Vela's Think-Reason-Learn research program. Each attacks the quant VC prediction problem from a different methodological angle:

  • GPTree, the foundational paper, which introduced LLM-powered decision trees and established that LLMs could generate the structure of a decision system.
  • Random Rule Forest (RRF), which treats LLMs as feature generators and aggregates their YES/NO outputs through a deliberately simple unit-weight vote.
  • Reasoned Rule Mining (RRM), this paper, which adds Bayesian calibration, log-odds fusion, and selective cascade inference.
  • Policy Induction, which moves the decision logic into editable natural-language policies embedded in prompts.

RRM is the most mathematically principled of the four. It is the paper that answers the question “what would it take to turn an LLM into a quant VC classifier whose outputs you could actually calibrate, defend, and stake LP capital on.” The answer: treat it as a calibration problem, do the rule mining carefully, build a cascade, fuse on the odds scale, and optimize the threshold for your actual utility function.

All four architectures are implemented as modules inside Think-Reason-Learn, the open-source framework Vela built to generalize these methods beyond venture capital.

Limitations

The paper identifies three limitations.

Monotonic calibration assumptions. The Platt scaling step assumes a monotone link between the raw score and the true log-odds. When the relationship is non-monotone, isotonic regression or constrained splines would be better suited.

Stage dependence. The additive log-odds decomposition is exact only when the two stage scores are conditionally independent given the class. When they are correlated, the meta-logistic combiner compensates, but the correlation structure must be monitored.

Scope. The framework is binary and single-period. Multiclass outcomes and temporal prediction horizons are identified as promising but unaddressed directions.

Read the paper

Reasoned Rule Mining: Precision-Optimised Weighted LogProb Classification.
Jack Preuveneers, Yigit Ihlamur.
IEEE CSCloud / SecureFinAI, 2025 (shorter version): ieeexplore.ieee.org/document/11261481.
Extended version: Commendation Award, ICIM 2026, Oxford (in process of printing).

RRM is part of Vela's quant VC research program, Think-Reason-Learn. Related papers: GPTree, Random Rule Forest, Policy Induction.

Authored by members of the Vela team. See the full roster of contributors.

For research collaboration in quant VC, LLM calibration, interpretable classification, or selective inference, email engage@vela.partners.

Privacy