Question 1

What does this rare-event prediction paper contribute to quant VC?

Accepted Answer

A multi-model framework where an LLM engineers 63 features from unstructured founder data and a classical ML ensemble (XGBoost + Random Forest + Linear Regression meta-model) predicts a continuous total-funding estimate that is then thresholded into a binary success prediction. On a 10,825-founder dataset with 8.5% baseline, the model reaches 10.3x precision overall and 11.1x peak, at 36% recall. Developed by Vela Research and the University of Oxford. arXiv:2509.08140.

Question 2

What is quant VC?

Accepted Answer

Quant VC is the application of quantitative, reproducible, empirically validated methods to venture capital decision-making. It treats venture capital as a rare-event prediction problem and brings the same rigor to it that quantitative finance brings to credit risk or quantitative medicine brings to diagnostic screening. A quant VC firm requires quantitative scoring against honest baselines, reproducible methodology, and interpretability so every prediction can be audited.

Question 3

Where does this paper fit in Vela's quant VC research program?

Accepted Answer

In the LLM-Augmented ML strand of Vela's quant VC research, alongside GPT-HTree (hierarchical clustering + LLM personas), LLM-AR (neural-symbolic reasoning with ProbLog), and verifiable reasoning (LLMs as code generators). Every paper in this strand embeds the LLM inside a classical ML pipeline rather than making it the final decision-maker. Here the LLM's job is feature engineering; XGBoost, Random Forest, and Linear Regression produce the prediction.

Question 4

How does the pipeline forecast founder success?

Accepted Answer

Three stages. (1) LLM feature engineering generates 63 features from LinkedIn/Crunchbase text, including reasoning-based signals like Domain Expertise (0-3 scale) and Skill Relevance (0-4 scale). (2) XGBoost and Random Forest process the features; their outputs plus ada-002 startup-description embeddings feed a Linear Regression meta-model that predicts a continuous total-funding estimate (below 4% MAPE across test subsets). (3) The funding estimate is mapped to a success probability via logistic regression and thresholded at 0.8 to reflect asymmetric false-positive costs.

Question 5

How accurate is the rare-event prediction model?

Accepted Answer

Across three disjoint held-out test subsets of 722 founders each: Subset 1 10.4x precision / 36% recall; Subset 2 11.1x / 35%; Subset 3 9.8x / 38%. Overall 10.3x precision, 36% recall. Ablations: without LLM-engineered features, precision drops from 10.4x to 4.6x (more than halving the model). Removing categorical features drops precision 3.9x, XGBoost 3.2x, Random Forest 2.3x, ada-002 embeddings 1.7x.

Question 6

Why does LLM feature engineering matter for quant VC?

Accepted Answer

A founder's LinkedIn profile contains signals that no structured field captures alone, whether a CTO's prior roles match the startup's technical stack, whether an MBA's consulting background translates to operational execution, whether prior exits were substantive or cosmetic. Before LLMs, quant VC pipelines either ignored these signals or used brittle keyword matching. This paper replaces keyword matching with LLM-scored features and the 4.6x vs 10.4x ablation gap quantifies the impact. The LLM reads; the classical models predict.

Question 7

What makes the architecture auditable?

Accepted Answer

Three inspectable artifacts per prediction: the 63 engineered features with per-founder values, the XGBoost and Random Forest contributions visible via feature sensitivity analysis, and the meta-model weights. Feature sensitivity surfaces a consistent hierarchy: category list 15.6% of predictive weight (strongest single predictor), followed by number of founders, skill relevance, domain expertise, education level. A partner can inspect which features drove a prediction, edit an LLM-derived feature value, and rerun the model.

Question 8

How does the model support funding segmentation?

Accepted Answer

The continuous funding output lets startups be placed into funding classes with calibrated success probabilities: 1.27% at $100K to $1M predicted funding, 8.41% at $1M to $10M, 80.89% at $10M to $100M, 95.35% at $100M to $1B, 100% at $1B+. This supports portfolio construction beyond simple binary screening.

Question 9

What are the paper's limitations?

Accepted Answer

Four. (1) The layered design (continuous funding followed by logistic regression) introduces error propagation, though funding-estimate MAPE is below 4%. (2) All 63 features are LLM-derived, so the feature space inherits LLM misclassification risk, especially for subjective features like Skill Relevance. (3) Profiles are built from LinkedIn and Crunchbase, embedding coverage biases toward founders with greater online presence. (4) LLM hallucinations during feature engineering remain a data-quality concern and are flagged as future work.

Rare-Event Prediction: LLM features, ensemble learning

What this paper contributes to quant VC

What is quant VC, and where does this paper fit?

How does the pipeline forecast founder success?

How accurate is the model?

Why LLM-powered feature engineering matters for quant VC

What makes this architecture auditable for quant VC decisions

How this paper fits into Vela's quant VC research program

Limitations

Read the paper