Rare-Event Prediction: LLM Feature Engineering and Ensemble Learning for Quant VC

Paper
From Limited Data to Rare-event Prediction: LLM-powered Feature Engineering and Multi-model Learning in Venture Capital.
Authors
Mihir Kumar (University of Oxford), Aaron Ontoyin Yin, Zakari Salifu, Kelvin Amoaba, Afriyie Kwesi Samuel, Fuat Alican, Yigit Ihlamur (Vela Research).
Venue
arXiv preprint, September 2025.
Status
Preprint. arXiv:2509.08140.
Research program
Part of the LLM-Augmented ML line of Vela's quant VC research, which embeds LLMs as components inside classical ML pipelines rather than as standalone reasoners.

What this paper contributes to quant VC

The paper introduces a multi-model framework that uses LLMs to engineer features from unstructured founder data, feeds those features into an ensemble of classical ML models (XGBoost, Random Forest, and a Linear Regression meta-model), and produces first a continuous funding forecast and then a binary success prediction. The central design thesis: LLMs are most useful to quant VC not as decision-makers, but as rich feature extractors that surface signals (skill relevance, domain expertise, founder-idea fit) that traditional pipelines cannot encode.

On a 10,825-founder dataset with an 8.5 percent baseline success rate, the model reaches 10.3x baseline precision overall and up to 11.1x on one of three held-out test subsets, while maintaining 36 percent recall. Ablation analysis shows LLM-derived features drive most of the lift: removing them drops precision from 10.4x to 4.6x, more than halving the model.

What is quant VC, and where does this paper fit?

Quant VC is the application of quantitative, reproducible, empirically validated methods to venture capital decision-making. Quant VC treats venture capital as a rare-event prediction problem that can be modeled, measured, and improved with the same rigor that quantitative finance brings to credit risk or quantitative medicine brings to diagnostic screening. Quant VC requires quantitative scoring against honest baselines, reproducible methodology, and interpretability that allows every prediction to be audited.

This paper sits in the LLM-Augmented ML strand of Vela's quant VC research, alongside GPT-HTree (hierarchical clustering plus LLM personas), LLM-AR (neural-symbolic reasoning with ProbLog), and verifiable reasoning (LLMs as code generators). This strand wraps LLMs inside classical ML pipelines rather than making the LLM the final decision-maker. In this paper specifically, the LLM's job is feature engineering. The prediction itself is produced by standard XGBoost, Random Forest, and Linear Regression components, which means the quant VC outputs are as reproducible as any traditional ML pipeline.

How does the pipeline forecast founder success?

The pipeline has three stages: LLM feature engineering, multi-model learning of funding, and threshold-based conversion to binary success.

LLM feature engineering. Starting from raw LinkedIn and Crunchbase data, the pipeline uses LLMs to generate 63 trainable features, following the approach of Ozince and Ihlamur (2024). Features are organized into categorical, textual, continuous, and boolean groups. Some features are mechanical (education level, number of founders), but the most valuable ones require LLM reasoning: for example, Domain Expertise (encoded as 0 to 3 for No / Weak / Moderate / Strong alignment between founder experience and startup domain) and Skill Relevance (0 to 4 for the fit between founder skills and the startup's technical problem). These are features that a traditional pipeline cannot produce because they require natural-language reasoning over unstructured career data.

Multi-model learning. The first layer combines XGBoost and Random Forest, chosen for complementary strengths. XGBoost handles high-dimensional multi-category data well. Random Forest is more robust to overfitting and provides stronger interpretability. Their outputs, along with ada-002 text embeddings of the startup description, feed a Linear Regression meta-model that produces a continuous estimate of total funding. Funding prediction error is low, with mean absolute percentage error below 4 percent across all test subsets (3.32, 3.02, and 3.89 percent).

Threshold-based binary conversion. The continuous funding estimate is mapped to a success probability through logistic regression, then thresholded at 0.8 (higher than the conventional 0.5) to reflect the asymmetric cost of false positives in venture investing. This two-step design also supports segmentation: the continuous funding output lets startups be placed into funding classes with calibrated success probabilities (1.27 percent at $100K to $1M predicted funding, 8.41 percent at $1M to $10M, 80.89 percent at $10M to $100M, 95.35 percent at $100M to $1B, 100 percent at $1B+).

How accurate is the model?

The model was trained on 8,659 founders and evaluated on three disjoint held-out test subsets of 722 founders each, with baseline success rates of 7.9, 6.8, and 9.3 percent reflecting realistic class imbalance. Success is defined at the standard Vela threshold: the startup reached a $500M+ IPO, $500M+ acquisition, or raised more than $500M. Unsuccessful means the startup raised between $100K and $4M.

Results per test subset:

  • Subset 1: 10.4x precision over baseline, 36 percent recall.
  • Subset 2: 11.1x precision, 35 percent recall.
  • Subset 3: 9.8x precision, 38 percent recall.
  • Overall: 10.3x precision, 36 percent recall.

Ablations confirm where the lift comes from:

  • Without LLM-engineered features (38 deterministic features only): 4.6x precision at 26 percent recall. More than half of the model's precision comes from LLM features.
  • Without ada-002 embeddings: 8.7x precision (versus 10.4x with).
  • Without XGBoost: precision drops by 3.2x, recall by 7 points.
  • Without Random Forest: precision drops by 2.3x, recall by 5 points.
  • Without categorical features (mostly LLM-derived): precision drops by 3.9x. This is the largest single-category drop, confirming that LLM-powered categorical features do the heaviest predictive work.

Vela's full production quant VC stack, across the Think-Reason-Learn family and related research, reaches 19 to 38 percent precision when scaled to the 1.9 percent real-world base rate, a 10x to 20x lift over the index. This paper's 10.3x overall and 11.1x peak sits inside that program, contributing a reusable multi-model architecture for combining LLM-engineered features with classical ML.

Why LLM-powered feature engineering matters for quant VC

A founder's LinkedIn profile contains signals that no structured field can capture on its own: whether a CTO's prior roles actually match the startup's technical stack, whether an MBA's consulting background translates to operational execution, whether a serial founder's prior exits were substantive or cosmetic. These are the features that separate strong from weak founders, and they require reasoning over unstructured text.

Before LLMs, quant VC pipelines either ignored these signals or encoded them through brittle keyword matching. This paper's approach replaces the brittle step with an LLM that reads the profile and emits a calibrated score. The 4.6x-versus-10.4x gap in the ablation quantifies the impact. The LLM is doing what it is good at (reading and scoring) and leaving prediction to models that are good at prediction.

What makes this architecture auditable for quant VC decisions

Every prediction decomposes into three inspectable artifacts: the 63 engineered features with their values for this specific founder, the individual XGBoost and Random Forest contributions visible through feature sensitivity analysis, and the meta-model weights that combined them. Feature sensitivity surfaces a consistent hierarchy: category list (15.6 percent of predictive weight) is the single strongest predictor, followed by number of founders, then skill relevance, domain expertise, and education level at much smaller weights. A partner reviewing a prediction can see which features drove it, edit an LLM-derived feature value that disagrees with their own judgment, and rerun the model. The reasoning lives in the feature vector, not in the LLM's hidden state.

How this paper fits into Vela's quant VC research program

The paper connects to the broader Vela quant VC research program along three axes:

  • Same family: This paper, GPT-HTree, LLM-AR, and verifiable reasoning all treat the LLM as a component inside a classical ML pipeline rather than as the final decision-maker.
  • Adjacent via method: The paper explicitly extends GPTree (Xiong et al., 2024, 9.4x precision), Founder-GPT (Xiong and Ihlamur, 2023, source of the founder-idea fit signal), and Random Rule Forest (Griffin et al., 2025, 5.4x precision), all from the Think-Reason-Learn family. The distinct move here is to redirect the LLM from generating decision rules toward generating features. Related TRL work includes Reasoned Rule Mining and Policy Induction.
  • Adjacent via pattern: The multi-agent SSFF paper orchestrates LLMs as reasoning agents. This paper uses a single LLM as a feature engineer.
  • Benchmarking: Descendants of this method and related approaches are evaluated on VCBench, the public benchmark for quant VC.

Limitations

The paper is explicit about what it does not yet resolve. The layered design (continuous funding predictor followed by logistic regression) introduces error propagation, so errors in the funding estimate, though small (below 4 percent MAPE), still propagate into the classification probability. All 63 features are LLM-derived, which means the entire feature space inherits whatever misclassification risk the underlying LLM carries, especially for subjective features like Skill Relevance. Founder profiles are built from publicly available sources (LinkedIn, Crunchbase), which embeds coverage biases toward founders with greater online presence. LLM hallucinations during feature engineering are a data-quality concern, and detecting and reducing them is flagged as future work.

Read the paper

From Limited Data to Rare-event Prediction: LLM-powered Feature Engineering and Multi-model Learning in Venture Capital.
Mihir Kumar, Aaron Ontoyin Yin, Zakari Salifu, Kelvin Amoaba, Afriyie Kwesi Samuel, Fuat Alican, Yigit Ihlamur.
arXiv preprint, September 2025.
arXiv:2509.08140.

This paper is part of the LLM-Augmented ML family of Vela's quant VC research. For related work in the same family, see GPT-HTree, LLM-AR, and verifiable reasoning. For the adjacent rule-induction line, see GPTree, Random Rule Forest, Reasoned Rule Mining, and Policy Induction, all part of the Think-Reason-Learn family. For the multi-agent line, see Founder-GPT and SSFF. For the benchmark that measures progress across all three families, see VCBench.

Authored by members of the Vela team. See the full roster of contributors.

For research collaboration in quant VC, LLM-powered feature engineering, and multi-model ensembles for rare-event prediction, email engage@vela.partners.

Privacy